feat(m4): variant quality metrics and near-duplicate deduplication#23
Merged
Conversation
- extractor: add compute_ink_ratio, compute_dhash, hamming_distance - schema: quality.ink_ratio is now required on every variant - generator: embed quality.ink_ratio per variant; greedy dHash dedup pass collapses near-duplicates (Hamming ≤ 10) per letter, keeping highest ink_ratio; _dhash key stripped before writing letter_set.json - examples: add quality.ink_ratio to all fixtures in writer_example.json - tests: 20 new tests covering ink_ratio, dHash, hamming_distance, quality embedding, dedup logic (remove dupes, keep best ink_ratio, keep distinct glyphs, no _dhash leakage) - docs: add docs/design/quality-and-dedupe.md design record Closes #22 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Correctness fixes: - generator: write PNG files only after dedup, not before; eliminates orphaned files for variants dropped during near-duplicate clustering - generator: preserve cluster-centre _dhash when updating representative with a higher-ink_ratio candidate; prevents greedy cluster from drifting across successive substitutions (A->B->C chain bug) - generator: derive used_entry_ids and observed_licenses from post-dedup survivors only; entries/licenses contributed solely by deduped-out variants no longer appear in the manifest Design fixes: - extractor: remove spurious _require_cv2() from compute_ink_ratio; the function is pure NumPy arithmetic and has no OpenCV dependency - extractor: vectorise compute_dhash with np.packbits instead of Python nested loop over individual numpy elements; also add hash_size >= 1 validation - generator: update module docstring to describe M3/M4 pipeline steps Observability: - generator: emit GeneratorWarning when dedup drops variants, with letter and count; callers now have visibility into silent data loss Test quality: - test_generator: remove dead fake_binary MagicMock wiring in test_generate_mocked_variant_has_quality (was overridden by patch) - test_generator: add 8 isolated unit tests for _dedup_letter_variants including threshold boundary, cluster-drift regression, single variant, and internal-key-not-stripped assertions - test_generator: two integration dedup tests now use pytest.warns to assert the GeneratorWarning is actually emitted - _dedup_letter_variants: no longer strips _dhash/_png_bytes; caller strips both after writing survivors (cleaner separation of concerns) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
compute_dhash now uses numpy (np.packbits) for vectorised hashing. The mypy CI job installs only .[typecheck] — no cv extra — so numpy was not available and mypy reported import-not-found. Numpy 1.20+ ships its own py.typed stubs, so adding it to typecheck gives real type coverage rather than a suppressed import. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements M4: per-variant ink-ratio quality metric and greedy dHash near-duplicate deduplication.
Changes
extractor.py— three new public functions:compute_ink_ratio(binary, glyph) → float— fraction of ink pixels in the bbox; range [0.0, 1.0]compute_dhash(binary, glyph, *, hash_size=8) → int— 64-bit difference hash (perceptual hash based on horizontal pixel gradients)hamming_distance(a, b) → int— bit-wise Hamming distance between two integer hashesletter_set.schema.json—qualityis now a required field on every variant:generator.py— two pipeline additions per writer:ink_ratioand_dhashfor each glyph crop after binarisation._dedup_letter_variants) per letter: collapses near-duplicates (Hamming ≤ 10) keeping the variant with the highestink_ratio. The internal_dhashkey is stripped before the document is written.Tests — 20 new test cases:
compute_ink_ratioandcompute_dhash.examples/letter_set/writer_example.json— all four fixture variants updated with plausiblequality.ink_ratiovalues.docs/design/quality-and-dedupe.md— design record covering algorithm spec, decision table, schema changes, and known limitations.Key design decisions
ink_ratio(ink fraction from existing binary array; no extra deps)ink_ratiowinsCI
All 167 tests pass, 91.5% coverage, ruff clean, mypy strict clean.
Closes #22
🤖 Generated with Claude Code